BUG: Fix get dummies unicode error #22131

Scorpil · 2018-07-30T15:59:02Z

closes pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Scorpil · 2018-07-30T15:59:58Z

pandas/tests/reshape/test_reshape.py

+        df = pd.DataFrame({'x': [u'ä']})
+        result = pd.get_dummies(df)
+        expected = pd.DataFrame({u'x_ä': [1]}, dtype=np.uint8)
+        assert_frame_equal(result, expected)


This one would pass even without a fix, but I've included it for completeness.

More coverage is (almost) never a problem for us 🙂

codecov · 2018-07-30T17:53:33Z

Codecov Report

Merging #22131 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master   #22131      +/-   ##
==========================================
- Coverage   92.06%   92.06%   -0.01%     
==========================================
  Files         169      169              
  Lines       50689    50693       +4     
==========================================
+ Hits        46667    46670       +3     
- Misses       4022     4023       +1

Flag	Coverage Δ
#multiple	`90.47% <87.5%> (-0.01%)`	⬇️
#single	`42.32% <12.5%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/reshape.py	`99.57% <87.5%> (-0.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 615615a...662cac3. Read the comment docs.

Scorpil · 2018-07-31T07:53:59Z

@gfyoung @jreback this PR is ready for merge, please check.

jreback · 2018-07-31T12:55:23Z

pandas/tests/reshape/test_reshape.py

+        expected = pd.DataFrame({u'x_ä': [1]}, dtype=np.uint8)
+        assert_frame_equal(result, expected)
+
+        df = pd.DataFrame({'x': ['a']})


can you parametrize this test?

jreback · 2018-07-31T13:00:39Z

pandas/core/reshape/reshape.py

-                      else '{prefix}{sep}{level}' for v in levels]
-        dummy_cols = [dummy_str.format(prefix=prefix, sep=prefix_sep, level=v)
-                      for dummy_str, v in zip(dummy_strs, levels)]
+        py2_prefix_is_unicode = isinstance(prefix, text_type)


can you make a little helper function here so that you can do this as a list-comprehension

jreback · 2018-07-31T13:00:57Z

pandas/core/reshape/reshape.py

@@ -923,11 +923,17 @@ def get_empty_Frame(data, sparse):

    number_of_cols = len(levels)

+    py2_prefix_sep_is_unicode = isinstance(prefix_sep, text_type)


make this explicit by also using PY2

gfyoung · 2018-08-02T08:18:42Z

pandas/tests/reshape/test_reshape.py

@@ -302,6 +302,26 @@ def test_dataframe_dummies_with_categorical(self, df, sparse, dtype):
        expected.sort_index(axis=1)
        assert_frame_equal(result, expected)

+    def test_dataframe_dummies_unicode(self):
+        df = pd.DataFrame(({u'ä': ['a']}))


Reference issue number as a comment above this line.

jreback · 2018-08-02T17:21:43Z

thanks @Scorpil !

* master: (47 commits) Run tests in conda build [ci skip] (pandas-dev#22190) TST: Check DatetimeIndex.drop on DST boundary (pandas-dev#22165) CI: Fix Travis failures due to lint.sh on pandas/core/strings.py (pandas-dev#22184) Documentation: typo fixes in MultiIndex / Advanced Indexing (pandas-dev#22179) DOC: added .join to 'see also' in Series.str.cat (pandas-dev#22175) DOC: updated Series.str.contains see also section (pandas-dev#22176) 0.23.4 whatsnew (pandas-dev#22177) fix: scalar timestamp assignment (pandas-dev#19843) (pandas-dev#19973) BUG: Fix get dummies unicode error (pandas-dev#22131) Fixed py36-only syntax [ci skip] (pandas-dev#22167) DEPR: pd.read_table (pandas-dev#21954) DEPR: Removing previously deprecated datetools module (pandas-dev#6581) (pandas-dev#19119) BUG: Matplotlib scatter datetime (pandas-dev#22039) CLN: Use public method to capture UTC offsets (pandas-dev#22164) implement tslibs/src to make tslibs self-contained (pandas-dev#22152) Fix categorical from codes nan 21767 (pandas-dev#21775) BUG: Better handling of invalid na_option argument for groupby.rank(pandas-dev#22124) (pandas-dev#22125) use memoryviews instead of ndarrays (pandas-dev#22147) Remove depr. warning in SeriesGroupBy.count (pandas-dev#22155) API: Default to_* methods to compression='infer' (pandas-dev#22011) ...

Scorpil commented Jul 30, 2018

View reviewed changes

Scorpil changed the title ~~Fix get dummies unicode error~~ BUG: Fix get dummies unicode error Jul 30, 2018

Scorpil force-pushed the fix_get_dummies_unicode_error branch from b6995f9 to 07975d6 Compare July 30, 2018 20:14

jreback requested changes Jul 31, 2018

View reviewed changes

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode 2/3 Compat labels Jul 31, 2018

gfyoung reviewed Aug 2, 2018

View reviewed changes

Scorpil added 4 commits August 2, 2018 14:40

TST: get_dummies UnicodeEncodeError tests

06b52ab

BUG: fix Unicode error in get_dummies for Python 2

a26b3c5

DOC: whatsnew entry

20589d9

CLN: parametrize test, codestyle update

15f2946

Scorpil force-pushed the fix_get_dummies_unicode_error branch from 50baa9a to 15f2946 Compare August 2, 2018 12:41

jreback added this to the 0.24.0 milestone Aug 2, 2018

jreback approved these changes Aug 2, 2018

View reviewed changes

comment

662cac3

jreback merged commit 5076ebe into pandas-dev:master Aug 2, 2018

dberenbaum pushed a commit to dberenbaum/pandas that referenced this pull request Aug 3, 2018

BUG: Fix get dummies unicode error (pandas-dev#22131)

ebc5a74

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: Fix get dummies unicode error (pandas-dev#22131)

3caed45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix get dummies unicode error #22131

BUG: Fix get dummies unicode error #22131

Scorpil commented Jul 30, 2018 •

edited

Loading

Scorpil Jul 30, 2018

gfyoung Aug 2, 2018 •

edited

Loading

codecov bot commented Jul 30, 2018 •

edited

Loading

Scorpil commented Jul 31, 2018

jreback Jul 31, 2018

jreback Jul 31, 2018

jreback Jul 31, 2018

gfyoung Aug 2, 2018

jreback commented Aug 2, 2018

		@@ -923,11 +923,17 @@ def get_empty_Frame(data, sparse):

		number_of_cols = len(levels)

		py2_prefix_sep_is_unicode = isinstance(prefix_sep, text_type)

BUG: Fix get dummies unicode error #22131

BUG: Fix get dummies unicode error #22131

Conversation

Scorpil commented Jul 30, 2018 • edited Loading

Scorpil Jul 30, 2018

Choose a reason for hiding this comment

gfyoung Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jul 30, 2018 • edited Loading

Codecov Report

Scorpil commented Jul 31, 2018

jreback Jul 31, 2018

Choose a reason for hiding this comment

jreback Jul 31, 2018

Choose a reason for hiding this comment

jreback Jul 31, 2018

Choose a reason for hiding this comment

gfyoung Aug 2, 2018

Choose a reason for hiding this comment

jreback commented Aug 2, 2018

Scorpil commented Jul 30, 2018 •

edited

Loading

gfyoung Aug 2, 2018 •

edited

Loading

codecov bot commented Jul 30, 2018 •

edited

Loading